Goto

Collaborating Authors

 cheeger constant



A Proof of Proposition 2.2: additive expansion proposition

Neural Information Processing Systems

We first define the restricted Cheeger constant in the link prediction task. Then, according to Proposition 2.1, we have: Then, we can draw the same conclusion with Eq.12, and the Thus, Eq.16 can be simplified to: "sites" Based on the Eq.15 and Eq.17, we can rewrite L The inequality holds due to the assumption. Knowledge discovery: In the 5 random experiments, we add 500 pseudo links in each iteration. The metadata information of the nodes are all strongly relevant to "Linux" Both papers focus on the "malware"/"phishing" under the topic "Computer security". The detailed result of the case study is shown in Table 6.


FoSR: First-order spectral rewiring for addressing oversquashing in GNNs

arXiv.org Artificial Intelligence

Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks. Graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2008) are a broad class of models which process graph-structured data by passing messages between nodes of the graph. Due to the versatility of graphs, GNNs have been applied to a variety of domains, such as chemistry, social networks, knowledge graphs, and recommendation systems (Zhou et al., 2020; Wu et al., 2020). GNNs broadly follow a message-passing framework, meaning that each layer of the GNN aggregates the representations of a node and its neighbors, and transforms these features into a new representation for that node.


Expander Graph Propagation

arXiv.org Artificial Intelligence

Deploying graph neural networks (GNNs) on whole-graph classification or regression tasks is known to be challenging: it often requires computing node features that are mindful of both local interactions in their neighbourhood and the global context of the graph structure. GNN architectures that navigate this space need to avoid pathological behaviours, such as bottlenecks and oversquashing, while ideally having linear time and space complexity requirements. In this work, we propose an elegant approach based on propagating information over expander graphs. We leverage an efficient method for constructing expander graphs of a given size, and use this insight to propose the EGP model. We show that EGP is able to address all of the above concerns, while requiring minimal effort to set up, and provide evidence of its empirical utility on relevant graph classification datasets and baselines in the Open Graph Benchmark. Importantly, using expander graphs as a template for message passing necessarily gives rise to negative curvature. While this appears to be counterintuitive in light of recent related work on oversquashing, we theoretically demonstrate that negatively curved edges are likely to be required to obtain scalable message passing without bottlenecks. To the best of our knowledge, this is a previously unstudied result in the context of graph representation learning, and we believe our analysis paves the way to a novel class of scalable methods to counter oversquashing in GNNs.


From graph cuts to isoperimetric inequalities: Convergence rates of Cheeger cuts on data clouds

arXiv.org Machine Learning

In this work we study statistical properties of graph-based clustering algorithms that rely on the optimization of balanced graph cuts, the main example being the optimization of Cheeger cuts. We consider proximity graphs built from data sampled from an underlying distribution supported on a generic smooth compact manifold $M$. In this setting, we obtain high probability convergence rates for both the Cheeger constant and the associated Cheeger cuts towards their continuum counterparts. The key technical tools are careful estimates of interpolation operators which lift empirical Cheeger cuts to the continuum, as well as continuum stability estimates for isoperimetric problems. To our knowledge the quantitative estimates obtained here are the first of their kind.


Non asymptotic bounds in asynchronous sum-weight gossip protocols

arXiv.org Machine Learning

This paper focuses on non-asymptotic diffusion time in asynchronous gossip protocols. Asynchronous gossip protocols are designed to perform distributed computation in a network of nodes by randomly exchanging messages on the associated graph. To achieve consensus among nodes, a minimal number of messages has to be exchanged. We provides a probabilistic bound to such number for the general case. We provide a explicit formula for fully connected graphs depending only on the number of nodes and an approximation for any graph depending on the spectrum of the graph.


Sampling for Bayesian Mixture Models: MCMC with Polynomial-Time Mixing

arXiv.org Machine Learning

Various researchers have studied posterior inference of parameters in Bayesian mixture models [24, 42, 23], so that the statistical behavior of such models is relatively well-understood. In contrast, much less is known about the efficiency of different algorithms for sampling from the posterior distributions that arise from Bayesian mixture models. A standard approach for doing so is via some form of Markov Chain Monte Carlo (MCMC). Many different types of MCMC algorithms have been introduced for various types of Bayesian mixture models, including finite Bayesian mixture models [21, 49, 50, 26, 40], Dirichlet process mixture models [37, 41, 25, 28], and hierarchical and nested Dirichlet process models [52, 47]. Despite the plethora of possible MCMC methods, upper bounds on their mixing times are often challenging to establish. We refer the reader to the papers [27, 3, 55, 48, 57] for non-asymptotic upper bounds on mixing times for certain types of Bayesian models, different from those studied in this paper. In recent years, it has been increasingly common in the Bayesian literature to make use of a fractional likelihood--meaning an ordinary likelihood raised to some fractional power. Combining such a fractional likelihood with a prior distribution in the usual way leads to a class of posteriors known as power posterior or fractional posterior distributions. The power posterior distributions have been shown to have attractive properties in terms of robustness to mis-specification in Bayesian mixture models [39], and have been used in various applications 1 arXiv:1912.05153v1


Nonconvex sampling with the Metropolis-adjusted Langevin algorithm

arXiv.org Machine Learning

The Langevin Markov chain algorithms are widely deployed methods to sample from distributions in challenging high-dimensional and non-convex statistics and machine learning applications. Despite this, current bounds for the Langevin algorithms are slower than those of competing algorithms in many important situations, for instance when sampling from weakly log-concave distributions, or when sampling or optimizing non-convex log-densities. In this paper, we obtain improved bounds in many of these situations, showing that the Metropolis-adjusted Langevin algorithm (MALA) is faster than the best bounds for its competitor algorithms when the target distribution satisfies weak third- and fourth- order regularity properties associated with the input data. In many settings, our regularity conditions are weaker than the usual Euclidean operator norm regularity properties, allowing us to show faster bounds for a much larger class of distributions than would be possible with the usual Euclidean operator norm approach, including in statistics and machine learning applications where the data satisfy a certain incoherence condition. In particular, we show that using our regularity conditions one can obtain faster bounds for applications which include sampling problems in Bayesian logistic regression with weakly convex priors, and the nonconvex optimization problem of learning linear classifiers with zero-one loss functions. Our main technical contribution in this paper is our analysis of the Metropolis acceptance probability of MALA in terms of its "energy-conservation error," and our bound for this error in terms of third- and fourth- order regularity conditions. Our combination of this higher-order analysis of the energy conservation error with the conductance method is key to obtaining bounds which have a sub-linear dependence on the dimension $d$ in the non-strongly logconcave setting.


Convex Optimization with Nonconvex Oracles

arXiv.org Machine Learning

In machine learning and optimization, one often wants to minimize a convex objective function $F$ but can only evaluate a noisy approximation $\hat{F}$ to it. Even though $F$ is convex, the noise may render $\hat{F}$ nonconvex, making the task of minimizing $F$ intractable in general. As a consequence, several works in theoretical computer science, machine learning and optimization have focused on coming up with polynomial time algorithms to minimize $F$ under conditions on the noise $F(x)-\hat{F}(x)$ such as its uniform-boundedness, or on $F$ such as strong convexity. However, in many applications of interest, these conditions do not hold. Here we show that, if the noise has magnitude $\alpha F(x) + \beta$ for some $\alpha, \beta > 0$, then there is a polynomial time algorithm to find an approximate minimizer of $F$. In particular, our result allows for unbounded noise and generalizes those of Applegate and Kannan, and Zhang, Liang and Charikar, who proved similar results for the bounded noise case, and that of Belloni et al. who assume that the noise grows in a very specific manner and that $F$ is strongly convex. Turning our result on its head, one may also view our algorithm as minimizing a nonconvex function $\hat{F}$ that is promised to be related to a convex function $F$ as above. Our algorithm is a "simulated annealing" modification of the stochastic gradient Langevin Markov chain and gradually decreases the temperature of the chain to approach the global minimizer. Analyzing such an algorithm for the unbounded noise model and a general convex function turns out to be challenging and requires several technical ideas that might be of independent interest in deriving non-asymptotic bounds for other simulated annealing based algorithms.


A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics

arXiv.org Machine Learning

We study the Stochastic Gradient Langevin Dynamics (SGLD) algorithm for non-convex optimization. The algorithm performs stochastic gradient descent, where in each step it injects appropriately scaled Gaussian noise to the update. We analyze the algorithm's hitting time to an arbitrary subset of the parameter space. Two results follow from our general theory: First, we prove that for empirical risk minimization, if the empirical risk is point-wise close to the (smooth) population risk, then the algorithm achieves an approximate local minimum of the population risk in polynomial time, escaping suboptimal local minima that only exist in the empirical risk. Second, we show that SGLD improves on one of the best known learnability results for learning linear classifiers under the zero-one loss.